Mesenchymal stromal cells (MSCs) are a subset non-hematopoietic adult stem cells that originate from the mesoderm. They have the capacity for self-renewal and differentiation into osteoblasts, adipocytes and chondrocytes making them important for the development of cell-based therapies in regenerative medicine. However, populations of MSCs are heterogeneous with respect to their differentiation capacity (Costa et al. 2020; Phinney 2012) and some exhibit immundomodulatory properties unhelpful for cell-based therapies. Understanding subtype heterogeneity is key for the development of efficacious therapies. The Genever Group has developed a model of this heterogeneity using has a number of immortalised clonal MSC lines which comprises five subtypes (Stone et al. 2019; Kay et al. 2019). The phenotypes of these subtypes is as follows:
The data are mass spectrometry data of the soluble protein fraction from five immortalised mesenchymal stromal cell (MSC) lines.
There are two data files: Y101_Y102_Y201_Y202_Y101-5.csv and comparison_p_and_q.csv
The data in Y101_Y102_Y201_Y202_Y101-5.csv are normalised protein abundances. Each row is a protein and there are three replicates for each subtype which are arranged in columns. Also in the file are columns for:
1::, bovine with 2::The data in comparison_p_and_q.csv give the p and q values for pairwise comparisons between cell line.
I used R (R Core Team 2019) with tidyverse packages (Wickham 2017) to import and process the raw data. Bovine serum proteins from the medium on which the cells were grown and those for which fewer than two peptides were detected were filtered out. Abundances were summarised for each line and for each line-protein combination and log fold change between cell lines for each protein were calculated. See Equation (2.1). Protein sets were visualised with the VennDiagram package (Chen 2018).
\[\begin{equation} FC_{jj'} = log_{2}\left(\frac{mean_{j}}{mean_{j'}}\right) \tag{2.1} \end{equation}\]
where:
Principal Components Analysis was used for dimensionality-reduction to allow visualisation the distance between samples. It was conducted on protein abundances scaled to unit variance to prevent highly abundant proteins dominating the analysis. The package GGally (Schloerke et al. 2020) was used to produced pairwise scatter plots of the first six principal components.
Hierarchical clustering was conducted and heatmaps produced with the package heatmaply (Galili et al. 2017) with interactivity provided by plotly (Sievert 2020). Hierarchical clustering used the complete linkage method and was performed for both proteins and samples on the proteins that differed significantly between at least two lines. Significance was determined at the \(q < 0.01\) level where \(q\) is the False Discovery Rate (FDR) adjusted \(p\)-value. FDR is a method of correcting for multiple comparisons by the Benjamini–Hochberg procedure (Benjamini and Hochberg 1995).
The abundances were “\(z\)-score normalised” for each protein before clustering and heatmapping. \(Z\)-score normalisation is way to make comparisons between, and visualisations of, abundances on different scales. See Equation (2.2).
\[\begin{equation} z_{ijk} = \frac{x_{ijk} - \bar{x_{ij}}}{sd_{ij}} \tag{2.2} \end{equation}\]
where:
There were 861 human protein identified from more than two peptides All 861 were found in all five cells line with two exceptions: DKK1 was not seen in Y201 and PAPPA2 was not seen in Y202. These are likely to be missing by chance from all three replicates rather than biologically meaningful. See figure 3.1. The total soluble protein content of each cell line was broadly similar. See figure 3.2.
Figure 3.1: Protein set shared by cells lines. Created with package VennDiagram (Chen 2018)
Figure 3.2: Mean soluble protein content for each cell line. Error bars are \(\bar{x} \pm 1 s.e.\)
The first Principal Component captured 30.6% of the variation in soluble protein expression between samples and the first six captured 81.6%. The distributions of scores on PC1 for each cell lineage show good separation between Y1015 and the other cell lines. The two immunomodulatory lines, Y102 and Y202 are the least easy to separate on a single component. Pairwise scatter plots of PC1 to PC6 show greater, but still imperfect, separation of lines. See figure 3.3.
Figure 3.3: Samples represented on the first six Principal Components capturing 81.6% of the variation in soluble protein expression between samples. The leading diagonal shows the distribution of scores on each component with pairwise scatter plots of Principal Components below the diagonal. Created with the GGally (Schloerke et al. 2020) and patchwork (Pedersen 2020) packages.
There were 327 proteins with \(p < 0.05\) and 190 proteins with \(p < 0.01\) but 861 tests implies 43.05 and 8.61 respectively are false positives. The FDR adjusted \(p\)-value (or \(q\)-value) of 0.01 implies that 1% of significant tests will result in false positives. The abundance of 335 proteins differed significantly between at least two cell lines at the \(q < 0.05\) level and between 127 proteins at the \(q < 0.01\) level. This implies only 16.75 and 1.27 false positives.
The distribution of \(q\)-values was right skewed (see Figure 3.4) with a mean of 0.0917 and a median of 0.0745. This what we would expect to see.
Figure 3.4: Distribution of \(q\)-values (\(p\)-values corrected for multiple comparisons by the Benjamini–Hochberg procedure (Benjamini and Hochberg 1995)).
to be completed Cell line replicates clustered by protein expression. These clustered corresponded The two immunomodulatory (see Figure 3.5) foorm one clus
Figure 3.5: my lovely heatmap
Word count calculated with wordcountaddin (Marwick 2020). The session information was written into a separate file rather than to the README this has been added to the wordcount.
This document: 1278
README: 103
Session info: 467
Total: 1848